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Abstract 

The causal assumptions, the study design and the data are the elements required for 
scientific inference in empirical research. The research is adequately communicated 
only if all of these elements and their relations are described precisely. Causal models 
with design describe the study design and the missing data mechanism together with 
the causal structure and allow the direct application of causal calculus and the concept 
of ignorability. The flow of the study is visualized by ordering the nodes of the causal 
diagram in two dimensions by their causal order and the time of the observation. 
Causal models with design offer a systematic and unifying view scientific inference and 
increase the clarity and speed of communication. Examples show graphical models for 
a salary survey, a clinical trial, a nested case-control study and a two-stage case-cohort 
study. 

1 Introduction 

Causal models are commonly used to describe the true or hypothesized causal rela- 
tionships between a set of variables. The model is typically presented as a directed 
acyclic graph (DAG), where the nodes represent the variables and the edges represent 
the causal relationship so that the arrow shows the direction of the effect. A graphical 
model serves as a tool for visualizing and discussing causal relationships but even more 
importantly it is a mathematically well-defined obje ct from whe re causal conclusions 



can be drawn in a systematic way. Causal calculus ( IPearll . 120091 ) can be used to esti- 
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mate caus al effec t s from observational data providing that the study has been carefully 
designed flRubinl . [2008h . 

Causal models are not sufficient for the estimation of causal effects without the 
data. After specifying the causal model and the objectives of the study, the first ques- 
tions of the researcher should be "How the data should be collect ed?" and "How 
the data collection should be taken into account in the analysis?" (IHeckmanl . 11979 



Rosenbaum and Rubinl . Il983l ). In many fields of science, the data are not obtained 
as a simple random sample of the population. The pressure of cost-efficiency leads 
to complex study designs where the expensive measurements are made only for a 



carefully selected subset of individuals ( 


Rcillv. 


1996: 


McNamee 




2002; 


Laneholz. 


2007; 


Kulathinal et al.. 


2007; 


Van Gestel et al.. 


200C 


; Karvanen et al.. 


2009 


). It is therefore 



crucial to take the study design into account in the estimation of causal effects. The 
increased compl exity of study designs also emphasizes the need for accurate and effi- 
cient reporting (|von Elm et al.l . 120071 : IVandenbroucke et all 120071 : ISchulz et all 12010 : 



Moher et al. 



2010h . 



An introduction to causal models with design is given through an example from 
survey sampling in Section [2J The formal definition of the concept is then presented in 
Section Eland in Section @]it is shown how the causal models with design can be utilized 
in data analysis. Examples describing a clinical trial, a nested case-control study and 
a two-stage case-cohort study as a graphical model are provided in Section [5j Finally, 
the benefits, the limitations and the implications of the proposed concept are discussed 
in Section [HI 



2 Introductory example 

Consider a study that aims to analyze the salary level of university graduates of year 
2010 using survey data. The surveys are carried out in two successive years 2011 and 
2012, which allows also the analysis of the annual changes in the salaries. The purpose 
of the example is to show how different study designs and different assumptions on 
the non-response will show up in a causal model with design. Four designs (a)-(d) 
are considered. In design (a), a random sample is selected on the first year and on 
the next year the individuals in the same sample are contacted again. Design (b) is 
otherwise similar to (a) but the probability of non-response depends on the salary level. 
In design (c), the sampling is done independently for both years and the individuals 
can be linked between the years. In design (d), the sampling is done independently for 
both years but the individuals present in both samples cannot be identified. 

The causal models with design for cases (a)-(d) are shown in Figure [TJ Variable 
Y\i represents the salary of the individual i on the year 2011 and Y 2 % on the year 
2012. It is assumed that the salary of the first year has a causal effect on the salary 
of the second year and therefore the graphical model has a directed edge from Yu to 
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Y-ii- This underlying causal model is naturally the same in all the designs. Variable 
rriQi represents an indicator for the population Q, a well-defined closed population 
containing all graduates of year 2010. It is defined mm = 1, % e Q and mm = 0, % ^ O. 
Variables mu and rri2i represent the survey sampling for the years 2011 and 2012, 
respectively. These indicator variables have value 1 if the individual was selected to the 
sample and otherwise. The arrow from to mu describes the fact that the sample 
is selected from the population, i.e. mu = 1 implies mm = 1. The values of mu and 
rri2i can be determined by the researcher, which is shown in the graph by using diamond 
symbols for the these nodes. Variables Mu and M 2 % represent the non-response related 
to the surveys of years 2011 and 2012, respectively. These indicator variables have 
value 1 if the individual responded to the survey and otherwise. Only responses from 
the individuals that belong to the sample are accepted, which is described by an arrow 
from mu to M u . 

Variables Yu and Yu are related to the underlying population and are not directly 
observed, which is shown in the visualization with open circles. Instead, the variables 
Y x * and Y 2 * are measured from the samples of the first and the second year after grad- 
uation, respectively. Because Y* t and Y 2 * are observed, they are shown as filled circles. 
The value of Y x * is Y u if the individual belongs to the sample and has responded, i.e. 
Mu = 1; otherwise Y x * is not available. This is described in the graph by arrows 
from Mu and Yu to Y x *. In other words, the causal assumptions, the study design 
and the data are all presented in the same graph where the causal effects are defined 
consistently regardless of the type of the variable. 

In design (a), the sampling (node mu) is done once and the probability of non- 
response in both 2011 and 2012 (nodes Mu and M 2 i) is independent from other 
variables except to th e sampling. In other words, this is a missing completely at 
random (jRubinl . Il976l ) situation and adjustments for the missing data are not nec- 
essary. The distribution of Y u and Y 2i can be estimated by sample distributions 
p(Xu I M u = 1) an d p(Y£i I M 2i = 1). The effect of Y u to Y 2i can be estimated 
hyp(Y* t \Y 1 * l ,Mu = l,M 2t = l). 

In design (b), the data collection is similar to design (a) but the probability of non- 
response depends on the salary level, which is sho wn by arrows from Yu to Mu and 
from Y 2i to M 2i . This is a missing not at random (jRubinl . 1 19 761 ) situation and proper 
estimation of the distributions p(Yu), p(Y 2i ) and p(Y 2i \ Yu) from the sample requires 
modeling of the selection probabilities p(M u = 1 | Yu) and p(M 2i = 1 | Y 2i ). 

In design (c), the sampling is carried out separately for each year. Consequently, 
there are two selection nodes mu and m 2 i. The population mm is known, which means 
that the individuals present in both samples can be identified by a unique identifier 
such as social security number. However, there is no guarantee that there will be any 
overlap between the samples. Therefore, the estimation of the effect of Yu to Y 2i is in 
general possible only if the set {i \ Mu = l,M 2 i = 1} is a large enough to allow the 
estimation of p(Y 2 * | Mu = 1, M 2i = 1). 
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Figure 1: Graphical models for different study designs of a salary survey. Causal vari- 
ables Y\i and Y21 represent salaries of the individual % on the years 2011 and 2012, 
respectively, and Y* { and Y" 2 * are the corresponding measurements. The causal rela- 
tionships are defined with respect to the population Q. The schemas for the sampling 
(nodes mu and m^i) and the non-response (nodes Mu and M 2 j) vary between the 
designs. 
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Design (d) is similar to design (c) in sense that the sampling is carried out sepa- 
rately for each year. The difference is that the unique identifier of each individual has 
been deleted after the sampling. In other words, the samples cannot be linked to the 
population or to each other. In the graph, this is shown by using open diamonds for the 
selection nodes run and m-ii- In this design, the distributions of Yu and Y 2 i can be still 
estimated by the sample distributions p{Yu I = 1) an d p(Y 2 * I ^2t = 1). The effect 
of Yu to Y 2i cannot be estimated in the general case but can be estimated for certain 
parametric models. For instance, assume that a linear model E(Y 2 i | Yu) = ftYu, where 
(3 is a regression coefficient, holds for all individuals i. The parameter /3 can be then 
estimated by (3 = E(F 2 *)/E(Y 1 *). The interpretation is simila r to the average causal 
effect in the Rubin causal model (IRubinl . Il974l ; Holland! Il986l ). 



3 Causal models with design 



The formal definition of causal mo dels with design relies on the definition of causal 
model s as presented by iPearll (120091 ) and the missing data concept presented by iRubin 
(I1976I ). The definition of causal models is extended to reflect the elements of inference: 
the causal assumptions, the study design and the data. The immediate benefit is that 
the methods of causal calculus are directly applicable for questions related to the study 
design and estimation. 



Causal structure and causal model are defined by IPearll (120091 ) as follows: 



Definition 1 (Causal Structure, Pearl 2.2.1) A causal structure of a set of vari- 
ables V is a directed acyclic graph (DAG) in which each node corresponds to a distinct 
element of V , and each link represents a direct functional relationship among the cor- 
responding variables. 

Definition 2 (Causal Model, Pearl 2.2.2) A causal model is a pair Ai =< D,Q D > 
consisting of a causal structure D and a set of parameters 0£> compatible with D. The 
parameters Qd assign a function Xj = fj(paj,Uj) to each Xj G V and a probability 
measure P{uj) to each Uj, where PAj are the parents of Xj in D and where each Uj 
is a random disturbance distributed according to P{uj) independently of all other u. 

Causal model with design can be defined as an extension of the functional causal 
model p resented by P earl where the notation for selection and missing data follows the 
lines of (IRubinl . Il976l ) : 

Definition 3 (Causal model with design) Causal model with design is a causal 
model that fulfills the following conditions: 

1. Each node in the causal structure is either a causal node, a selection node or 
a data node. Each node has an information type attribute with possible values: 
'observed', 'not observed', 'determined and known' and 'determined and unknown'. 
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Each selection node represents a binary variable with the possible values 1 and 0. 
If the selection node M 2 is a descendant of the selection node M 1; then M\ = 
implies M 2 = 0. There is always a unique selection node Mq (population node) 
which is an ancestor of all selection nodes and has value = 1. 

Each data node has two parents, one causal node and one selection node. A causal 
node cannot be a parent for more than one data node. For a data node X* with 
parents causal node X and selection node M, it holds 



X* 



where NA represents a missing value. 




1 





In the first item of Definition [31 the node types are named and the possible values 
information type attributes are listed. The information type attribute of the variable 
with the possible values 'observed', 'not observed' and 'determined and known' and 
'determined and unknown' describes the knowledge of the researcher. In visualizations 
these types are presented as a filled circle, an open circle, a filled diamond and an open 
diamond, respectively. In an observational setup, a causal variable X is not observed as 
such; only the corresponding measurement X* is observed. In an experimental setup, 
the values of some causal variables can be determined by the researcher. Usually, 
causal variables determined by the researcher are known but in principle they can 
be also unknown if the information on the values set for the variable has been lost 
after the execution of the experiment. The data are by definition always observed. A 
selection variable can have all four information types. The value of a selection variable 
is determined when sampling or other selection is applied to the population. The 
selection variable can be determined and known as in Figure [T](a-c) or determined and 
unknown as in Figure Q](d). When the missing data can be identified as an empty 
record, the selection variable is observed. If the missing individuals are not identified 
at all, as it is the case in left truncation for instance, the selection variable is not 
observed. 

In the second item of Definition [3l the role of the population and the selection 
variables is specified. Causal assumptions are always ma de with respect to some finite 



population f2 known as study source in epidemiology (jMiettinenl . 120111 ). There is 
always only one population node. If there is more than one conceptual population, the 
population Q is defined as the union of the conceptual populations. The conceptual 
population, for instance, a geographical area, becomes a causal variable in the model. 
If the causal mechanisms differ by the area, the model contains arrows from the area 
to the other causal nodes. This allows defining models where some causal relationships 
are similar across the areas and some different. The selection probabilities for the 
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sampling often also differ by the area, which is shown in the model by an arrow from 
the area to the selection node. 

The members of the population can be a priori known or unknown. In the for- 
mer case, the researcher has a unique identifier, for instance, social security number, 
available for each member of the population before the study. In the latter case, the 
researcher identifies the members of the population only when they enter to the study. 
A selection node M induces the subpopulation {i G Q | Mj = 1}, which consists of the 
selected individuals. The causal effects are typically estimated for the population Q 
but, for instance, in epidemiological cohort studies the effec ts are often estim ated only 
for the cohort {i G Q \ Mi = 1}, also known as study base (jMiettinenl . 1201 ll ). 

In the third item of Definition [31 the relations of the causal variables, the selection 
variables and the data are specified. The value of random variable Xi is measured only 
if the individual i is selected to be measured, which is indicated by the selection variable 
Mi. This means that the measured value X* is a random variable which depends on 
the variables Xj and Mj. The definition of a univariate random variable is extended 
so that in addition to real axis, a random variable may also have a special value 'NA' 
which indicates missing data. With this definition, all elements of scientific inference 
can be expressed as random variables and their causal relationships. If a data node or 
a selection node has a directed path to a causal node, the measurement or the selection 
has a causal effect to the underlying causal variable. This may be the case, for instance, 
in health examination studies where the participation to the study may increase the 
awareness on the healthy life style and consequently also have an impact to the later 
measurements of health indicators. 

In a causal model, the causal effects define a partial ordering between the variables. 
In addition to this causal time, the time of observation can be linked to each variable 
in a causal model with design. Together the causal time and the observational time 
define the relative location of each node in a visualization where the causal time is 
presented on x-axis and the observational time on y-axis. To make the visualization 
more informative, the stages of the study can be used as labels for the y-axis as it is 
done the examples of Sections [2] and [51 

Measurement error can be added to a causal model with design in a straightforward 
way. In the simplest case, the measurement X 2i is made on the causal variable X 2 i = 
Xu + U where Xu is the underlying causal variable without measurement error and U is 
the random measurement error corresponding the random disturbance in the definition 
of causal model (supplementary online text). In the case of correlated measurement 
errors, an explicit causal variable U has an effect to several causal variables with 
measurement error. 
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4 Application of causal models with design 



The causal models with design and their collapses defined below allow the direct ap- 
plication of causal calculus and the concept of ignorability. The following list shows 
how causal models with design can be used in various tasks of data analysis: 



Causal effects: The rules of causal calculus (jPearll . 120091 ) and concepts such as d- 
separation, back-door criterion and front-door criterion allow the estimation of 
the causal effects of interventions when causal model with design is collaps ed to an 
ordinary causal model. Alterna tive concepts such as influence diagrams (IDawid . 



2002 ; iDawid and Didelezl . l20!0l ) and moralization ( iLauritzen et all Il990l ) can be 



also applied. 

Missing data: The missing data mechanism is ignorable in likelihood based infer- 
ence for data missing at random (MAR) (jRubinl . Il976l ). Let M to be the se- 
lection variable for the measurement Y* of causal variable Y. If there is no 
directed path from Y to M, it holds p[M | Y) = p(M), which is a sufficient 
condition for the dat a being MAR. The results derived for the selection diagrams 
flDidelez et al.Ll20ldlGeneletti et all 120091 : ICooperl . l2000l : iBareinboim and Pearl 
2012 : Pearl and Bareinboim . 20121 ) are also directly applicable after the causal 



model with design is collapsed to selection diagram. 

Likelihood factorization: The likelihood factorized according to the causal model 
with design offers a natural starting point for the parameter estimation in both 
the frequentist and the Bayesian approach. The idea is to write first the full 
likelihood for the data, the design and the latent variables and then see which 
parts of the likelihood are not needed in the estimation of the parameters of the 
interest. Examples on the likelihood factorization are given in Appendix. 

Visualization and communication: Causal graphs with design remove the ambi- 
guity related to the common names of study designs such retrospect ive study, 

prospective study coho rt study, case-control study and two-stage study (jVandenbrouckel . 



1991c iKnol et all 120081 1 . The process of the data collection can be seen from the 



collapse to a study flow diagram or directly from the causal graph with design. 

A causal graph with design can be collapsed to an ordinary causal model, a selection 
diagram or a study flow diagram. The collapse to an ordinary causal model is defined 
removal of selection nodes and data nodes: 

Definition 4 (Collapse to an Ordinary Causal Model) Causal model 
M.o =< Do,Qd q > is a collapse of causal model with design M. =< -D,0£> > if (i) 
the set of nodes in Dq consist of the causal nodes of D, (ii) there exist an edge from 
node X to node Y in D Q if and only if there exists an edge from X to Y in D, (Hi) 
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the set of parameters ®d is subset ofQn so that if the node X belongs to Dq, the 
function assigned for X is the function assigned for X in D with a specification that 
all selection nodes in D have value 0. 

The specification that all selection nodes have value 0, i.e. no data are collected, is 
needed as a precaution for situations where the selection or the data collection may 
have an effect to the causal variables. 

Selection diagram is a causal diagram a ugmented with selection nodes. The defini- 



tion given by iPearl and Bareinboiml (120121 ) considers the selection from the viewpoint 



of transportability across populations: 

Definition 5 (Selection Diagram for Transportability) Let < M.\,M. 2 > be a 
pair of structural causal models (Definition^) relative to domains < Ili,Il2 >, sharing 
a causal structure D. Pair < Aii, M.2 > is said to induce a selection diagram H if H 
is constructed as follows: 

1. Every edge in D is also an edge in H; 

2. H contains an extra edge Mj — > Xj whenever there exists a discrepancy fij ^ f2j 
or P\{Uj) ^ P2(Uj) between Mi and M 2 - 



The a lt ernative definition used bvlDidelez et al.l (120101 ) ; iGeneletti et al.l ( 120091 ) ; [Cooper 



(120001 ): iBareinboim and Pearll ( 120121 ) considers selection bias due to sampling and is 



formulated here as follows: 

Definition 6 (Selection Diagram for Selection Bias) The selection diagram of a 
causal model M.o —< Do, ©d > and a set S of selection indicators is a causal model 
Ai =< -D,@d > where (i) the causal structure D has the nodes and edges of Dq and 
the nodes corresponding to the selection indicators S, (ii) if variable X has a direct 
causal effect to the value of the selection indicator M e S , there is an edge from X to 
M in D and (Hi) contains the parameters Qd an d the parameters describing the 
functional relationship between a selection node M e S and its parents. 

In Definition [5] the arrows point from selection nodes to causal nodes and in Definition [6] 
the arrows point from causal nodes to the selection nodes. The definitions can be 
merged and generalized using causal models with design. As defined in Section [31 there 
will be only one population and the conceptual populations (domains) are handled by 
introducing a causal variable for them. Now the collapse of causal model with design 
to a selection diagram can be defined as a collapse to an ordinary causal model and an 
identification of selection nodes: 

Definition 7 (Collapse to a Selection Diagram) Selection diagram H is a col- 
lapse of causal model with design M. =< D,Qd > and a set S C V of selection 
nodes if (i) the set of nodes in H consist of the nodes in S and the causal nodes of D, 
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(ii) there exist an edge from node X to node Y in H if and only if there exists an edge 
from X to Y in D. 



Study flow diagram is a description of the stages of the study in a graphical form: 

Definition 8 (Collapse to a Study Flow Diagram) Study flow diagram K is a 
collapse of causal model with design M. =< D,Qo > if (i) the set of nodes in K 
consist of the selection nodes and data nodes of D, (ii) there exist an edge from node 
X to node Y in K if and only if there exists a directed path from X to Y in D without 
any intermediate selection nodes or data nodes. 

Additional information such as the number of observations at each stag e of the study 



can b e added to the diagram as instructed e.g. in the STROBE statement ( Ivon Elm et al 



20071). 



5 Examples with complex study design 

The real test for the proposed concept is its practical usability in empirical research. 
The examples presented in this section aim to demonstrate how causal models with 
design can describe the essential features of complex experimental and observational 
studies in a precise and illustrative way. The examples are from medicine and epidemi- 
ology where complex study designs are commonly used. The likelihood functions are 
given in Appendix. 

Figure [2] illustrates a causal mod el with design fo r the t wo-stage case-coho rt de- 



sign used in the MORGAM Project flKulathinal et all . 120071 : Evans et all 120051 ). The 



project aims to estimate the impact of classic and genetic risk factors to the risk of 
cardiovascular diseases. Due to the cost of genotyping, genes can be measured only for 
a subset of the cohort. The example demonstrates how left truncation, cohort sampling 
and case-cohort sampling are shown in a causal model with design. 

Figure [3] shows how an experimental setup can be described in a causal model 
with design. The treatment of the clinical trial is a causal variable determined by 
the researcher by the means of randomization. The example also demonstates the 
compliance problem encountered in clinical trials: the actual treatment may differ 
from the allocated treatment if the participant does not follow the instructions given. 

Figure |U illustrates a nested case-cohort design where there is a dep endence struc- 



ture between the selection variables of the individuals in the sample ( ISaarela et al. 



20121 ). The graphical presentation drawn for individual i uses index j to refer to all 



other individuals. 
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Figure 2: Causal m odel with design for the two-stage case-coh ort design used in the 



2005h . The sampling frame 



MORGAM Project flKulathinal et all . l2007t Evans et all 
{i : M 0i = 1} is conditioned on the health status Y 0i (alive, 24-65 years old) at the 
beginning of the study and this dependence must be taken into account when estimates 
for the population {i : rriQi = 1} are required. At the first stage of the study, a random 
sample {i : ran = 1} is selected. Classic risk factors X* and current health status Y * 
are measured at the beginning of the study for the cohort members {i : Mn = 1}. 
Blood samples taken at the baseline are frozen to be used later. After a follow-up 
period of 10 years or more, the selection for the second stage is made on the basis of 
the measurements X* and Y*. All disease cases and an age-stratified random subset 
of the cohort are selected to the case-cohort set {i : rri2i = 1} for which genetic risk 
factors Z* are measured. Non-response M 2 j occurs due to contaminated samples and 
other technical reasons. 
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O Z| screening variables 

O X| baseline variables 

O Y, outcome variables 

9 Z-* screening measurements 

Xj* baseline measurements 

9 Y * outcome measurements 

+ Tj' allocated treatment 

O T| actual treatmenl 

+ T,* measured treatment (compliance) 

O rnQj population indicator 

+ nijj sampling indicator for screening 

4 m 2j sampling indicator for treatment allocatior 



Figure 3: Causal model with design for a clinical trial. A sample run is selected 
for screening from the population mm. The inclusion for the trial rri2i is based on 
the screening variable Z*. At the baseline, covariate X* is measured for the trial 
participants and a randomized decision on the treatment T[ is made. The actual 
treatment Tj during the treatment period may differ from the intended treatment T[ 
because of non-compliance. The outcome Fj depends on the covariate Xj and the 
treatment Tj. At the end of the treatment period measurements for the observed 
outcome Y* and the observed treatment Tj* are made. In the intention-to-treat analysis, 
the observed outcome Y* is explained by the intended treatment Tj' using all included 
participants {i : m 2 j = 1}. In the per-protocol analysis, only the compliant participants 
with T/ = T* are included. 
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O X, matching variables, e.g. age and sex 

O Zj risk factor of interest 

O Yj outcome variables 

9 Xj* measured matching variables 

Z* measured risk factor of interest 

f Y|* measured outcome 

Xj* measured matching variables for other individuals 
9 Yj* measured outcome variables for other individuals 
O m 0j population indicator 

+ m^ sampling indicator for outcome measurement 
+ m 2i sampling indicator for risk factor measurement 
M 2j non-response for risk factor measurements 



Figure 4: Causal model with design for a nested case-control study. The idea of the 
case-control design is to select the individuals for the measurement of the expensive 
risk factor Zi on the basis the response Yi and the inexpensive risk factor X; L . At the 
first stage, a sample {i : mu = 1} is selected from the population {i : m ffi = 1} and 
variables X* and Y* are measured. The selection of cases and controls m 2 i depends not 
only on measurements of individual i, X* and Y*, but also on the response Y* and the 
covariate X* of all other individuals in the sample {i : mu = 1}. Each individual j has 
a similar causal graph which has been omitted in the figure. The controls are selected 
considering the individuals at risk at the time (age or calendar time) of the disease 
event. A control may later become a case which creates a complicated dependence 
structure between the selection probabilities. The non-response Mu reflects the fact 
the measurement Z* is not available for all individuals selected to the case-control set. 
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6 Discussion 



Causal models with design offer a systematic and unifying view to the scientific infer- 
ence. They present the causal assumptions, the study design and the data collection 
in a way that accounts for the complexity encountered in real-world problems. The 
examples in Section demonstrate how the concept can be used to describe medical 
studies with multiple stages. 

Differently to earlier works, causal models with design present the population and 
the selection as intrinsic parts of the model. Selection nodes may have both incoming 
and outgoing connections to other nodes. A distinction is made between a random 
variable and its measured value. Combined with the selection this allows the description 
of various sampling and missing data setups in terms of causal effects. 

The limitations of the causal model with design are in many ways similar to the 
limitations of the causal models in general. The presentation of causal assumptions 
in the form of a graphical model has the benefit that many problems can be solved 
without specifying the parameters of the model. On the other hand, the explicit para- 
metric definition of the functional relationships is still the only decisive presentation 
of the model. Certain causal effects may be identifiable only under specific parametric 
assumptions as it was demonstrated in design (d) of Section El The presented concept 
provides a natural way for the factorization of the likelihood but does not directly 
contribute to the estimation of causal effects. 

The implications of the concept are two- fold. First, it ties together causality and 
study design and opens new possibilities for the practical application of graphical mod- 
els. Second, it shows the key elements of the study in a compact visual format and thus 
increases the clarity and speed of communication. High standards of design, analysis 
and communication of scientific studies will significantly reduce the time and effort 
needed for the synthesis of scientific knowledge. 
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Appendix: Likelihood factorizations 

For the estimation of the causal effects the causal model with design must be specified 
in a parametric form. In this section, likelihood functions are presented for the exam- 
ples of Section O The likelihood functions are derived for the population {i : m ffi = 1} 
with the size N starting from the factorization that follows directly from the DAG. At 
the first step, the likelihood function is written assuming that all variables are observed 
for the whole population. The measurements are redundant in this case because they 
are deterministic functions of the causal variables and the selection variables. The mea- 
surements becomes explicit when the likelihood function is further factorized according 
to the selection variables. Finally, the likelihood of the observed data is obtained as 
an integral over the unknown causal variables. 

Parameters 6 define the distribution of the causal variables and parameters cf> 
define the distribution of the selection variables. A vectorized notation similar to 
X = (Xi,X 2 , . . . ,X N ) is used for all variables and the distributions are defined with 
respect to the first argument unless otherwise specified. 

The likelihood function for the MORGAM Project case-cohort design presented in 



17 



Figure [2] has the form 

p(mn, M , mi, Mi, m 2 , M 2 , Z, X, F , F | 6>, V) = 

TV 

mni,Y i,xl>)p(mii \ M 0i ,if:)p(M li \ m u ,il>) 



8=1 



xp(X | Z,0)p(^ I ^,ioi,^,%(m 2 i | M H ,y oi ,y i ,X i ,'0)p(M 2i | m 2l ,if>) = 
l[ p z (Z* I d)p Yo (Y* | Z = Z*, 0)p(M 0t = 1 I m ni , Y 0l = Y*, ^)p(m u = 1 | M« = 

{i:M 2i =l} 

x p(M M = 1 | mii = 1, if>) Px (X* I Z = Z*, 0)p Y (Y* \ Z = Z*, Y oi = Y* t , X = X*, 0) 
x p{m 2t = 1 | M u = 1, F *, F/, X*, V)p(M 2 , = 1 | m 2i = 1, -0) 

II I Ko(^o* I 3, ^)p(M 0i = 1 | mm, F 0i = F *, ^)p(m M = 1 | M 0i = 1, 

{j:M2i=0,m 2 i = l} 

x P (M U = 1 1 m M = i, | z, 6>) Py (y; | z, y 0i = y *, x = x*, 0) 

x p(m 2i - 1 I Mu = 1, y *, Y*, X*, V)p(M 2i = | m 2i = 1, ^) 

J] p(Z | 6>)p yo (y o * I Z, 0)p(M Oi = 1 | mm, Y*, ^)p(m u = 1 | M 0i = 1, V) 

{i:m 2 i=0,Afii = l} 

x p(M u = 1 | m H = 1, I Z, 0)py(F/ | Z, y o *, X*, 0) 

x p(m 2 , = o | m u = i, y *, y;, x*, v) 

J] p(Z I 0)p(*oi I Z, 0)p(M Oi = 1 I mm, Y 0i , ip)p(m u = l\M 0i = l,^) 

{i:Mu=0,mii=l} 

x p{M u — | m u — 1, ^)p(X, | Z, 0)p(y | z, y 0l , X, 0) 

JJ p(Z I 0)p(Y oi I Z, 0)p(M Oi = 1 | mm, Y 0i , ^)p{m u = | M Qi = 1, if>) 

x P (x | z,0) P (y | z,y o ,,x,0) 
II p(^< I | Z,0)p(M Ol = o | m ffi ,y 0l , VMX I Z,0)p(F | z,y oi ,x,0). 

{i:M oi =0} 

The likelihood of the observed data is obtained as an integral over the unknown vari- 
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ables Z, X, Y and Y: 



p(rn< l ,Mo,m 1 ,M 1 ,m 2 ,M 2 ,Z*,X*,Y*,Y* \6,iJ>) = 

l[ p z (Z* | 6)py (Y* | Z t = Z*, 0)p(M 0l = 1 I mm, Y 0l = Y*, ^)p(m u = 1 | M oi = 1, V) 

{i:M 2i =l} 

x p(M M = 1 | m M = 1, if>) P x(X* I = Z*, 0) PY (Y* \ Z t = Z*, Y 0l = Y* t , X t = X*, 0) 
x p(m 2t = 1 | M u = 1, F *, F/, X*, V)p(M 2j = 1 | m 2i = 1, -0) 

II / I 0)Py (Xk I ^> 0)P*(*7 | Z h 0)p Y (Y* I Z i? Y 0l = Y % X, = X?, 0)dZ 

{i:M 2 i=0,m2i = l} 

x p(M w = 1 | mm, Y 0i = Y *, if>)p{m u = 1 | M oi = 1, ip)p(Mu = 1 | m u = 1, -0) 
x p(m 2i = 1 | Mii = 1, F *, F;, X*, V)p(M 2i = | m 2i = 1, V) 

{i:Mij=0,mij=l} 

xpfr, | Zi.y'oi.Xi.ejdZdA'dy'odyXmn = 1 1 m k = i,ip) P (M u = o\m u = i,y>) 

n 

{i:mii=0,M i = l} 

x p(y i | Z i? F w , Xi, 0) dZ dX dY dYp{m u = | M 0i = 1, -0) 

n 

{i:M 0l =0} 

x p(F | 2i, F 0l , X, 0) dZ dX dF dK 





p(Z I 0)p(y oi I Z i ,0)p{M Qi = 1 I m m> Yoi, VMX | ^,0) 



p(Z< | 0)p(Y Ol | 2i, 0)p(M Oi = | m ni , Y 0i , </>MX | Z h 0) 



The analysis of the MORG AM data can approach ed by using Bayesi an approach 
(IKulathinal and Ari ad. 120061). conditional likelihood fjSaarela et all [2009) or nonpara- 
metric imputation fjKarvanen et all l2010l ). 
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The likelihood function for the clinical trial presented in Figure E] has the form 
p(m n , m 1 ,m 2 , Z, X, Y, T, T \B,ijj) = 

N 

Wp{mni)p{m u \ m ni ,ip)p(Zi \ 0)p(m 2i \ m u , Z h ^)p{Xi \ 0) 

8=1 

x p{% | m 2t , ^)p(T t | Ti 6)p{Y l \ X u T u 6) = 
] [ PK = 1 I rn nh il>)p z {Z* | 6)p(m 2i = 1 | m u = 1, Z*, V>) 

{i:m 2i =l} 

x p x (X* | 0)p(7? | m 2l = 1, VOptCC I Ti 0) PY (Y* \ X, = X*, T< = T*, 6) 
Yl p(m u = 1 | m ni ,^)p z (Z* | 0)p(m 2i = | Z*,if))p x (Xi | 6) 

{i:rn 2 i=0,rn li =l} 

xp T (T l \e) PY (Y i \x i ,T l ,e) 

II p(™k = | m m ^)p z {Zi | 6>K(X t | 6)p T (Ti \ 0)p Y (Y l \ X h T h O). 

{i:mii=0} 

The likelihood of the observed data is obtained as an integral over the unknown vari- 
ables Z, X, T and Y: 

p(m n ,m 1 ,m 2 ,T',Z*,Y*,X*,T* | 0,t/>) = 

j [ p(mu = 1 | mm, ip)pz(Z* \ 6)p(m 2i = 1 | m u = 1, Z*, ip)p x (X* \ 6) 

{i:m 2i =l} 

x p(7* | m 2i = 1, V)Pt(77 I 2?, #)py 0? I * = X*, T t = T*, 6) 
J} p(mii = 1 | mm, i\))p z {Z* | 0)p(m 2i = 0\ Z*,xjj) 

{i:m,2i=0,in li =l} 

Yi p{m u = 0\mm,il>)- 

{i:mu=0} 

Only the first part of the likelihood is needed to estimate the effect of the treatment 
to the response. 
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The likelihood for the nested case-control study presented in Figure H] has the form 
p(m n ,m 1 ,m 2 ,Z,X,Y \0,ip) = 

N 

Y\p{m ni )p(mii | m ni ,^)p(Zi \ 0)p{Xi \ 0)p(Yi \ Z h Xi,0) 
i=i 

x p{m 2i | mu, X h Y h X, Y, x^)p(M 2i \ m 2i , -0) = 
II V{m u = 1 | mm, iJ>)pz(Z; | 0)p x (X* \ 0)p Y (Y* | Z t = Z*, X, = X*, 0) 

{i:M 2t =l} 

x p{m 2i = 1 | mu = 1, X*, Y*, X\ Y*, i>)p{M 2i = l\m 2i = l,^) 

II P(™u = 1 I ™n», ^)v{Zi | 0) Px (X* I 0)p Y (Y* | Zt, X = X*, 0) 

{i:M 2i =0,m 2i =l} 

x p(m 2i = 1 | mu = 1, X*, Y*, X*, Y*, ^)p{M 2i = | m 2i = 1, %j>) 

II p(™« = 1 1 m ™, 1>)p(Zi I o)p x (x* I #K (f; | Zi, x t = x*, o) 

{i:m 2 i=0,m li = l} 

x p(m 2i = | m u = 1, X*, F/, X*, F*, V) 
II = I ™ni, 4>)p(.Zi I I I Zt, X u 0). 

{i:mii=0} 

The likelihood of the observed data is obtained as an integral over the unknown vari- 
ables Z, X and Y 

p(mn,mi,m 2 ,Z*,Y*,X* \0,if)) = 

H p(m u = 1 | m ra , ip)p Z {Z* \ 0)p x (X* \ 0)p Y {Y* \ Z, = Z*,X { = X*, 0) 

{i:M 2t = l} 

x p{m 2i — 1 | mu — 1, X*, F/, X*, F*, ^)p(M 2i = 1 | m 2i = 1, -0) 

II pK = 1 1 m ^A) j p(z t | 0)p x (x* 1 6>) Py (F | z,,x = x*,0)dz 

{i:M 2i =0,m 2i =l} 

x p(m 2i = 1 | m u = 1, X*, F.*, X*, F*, ^)p(M 2i = | m 2i = 1, %j>) 

II p(™« = 1 I / P(^< I I 0)py(F | Z u Xi = X*,0)dZ 

{i:m 2i =0,m u = l} 

x p{m 2l — | m u — 1, X*, F/, X*, F*, ^) 
j ) p(mii = | m ni ,ijj). 

{j:mii=0} 
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